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Abstract 

Independent (uncoordinated) checkpointing for parallel and distributed systems allows 
maximum process autonomy but suffers from possible domino effects and the associated stor- 
age space overhead for maintaining multiple checkpoints and message logs. In most research 
on checkpointing and recovery it has been assumed that only the checkpoints and message 
logs older than the global recovery line can be discarded. We show in this paper how re- 
covery line transformation and decomposition can be applied to the problem of efficiently 
identifying all discardable message logs, thereby achieving optimal garbage collection. Com- 
munication trace-driven simulation for several parallel programs is used to show the benefits 
of the proposed algorithm for message log reclamation. 
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1 Introduction 


Numerous checkpointing and rollback recovery techniques have been proposed in the 
literature for parallel and distributed systems. They can be classified into three primary 
categories. Coordinated checkpointing schemes [1-5] synchronize computation with check- 
pointing by coordinating processors during a checkpointing session in order to maintain a 
consistent set of checkpoints. Each processor only keeps the most recent successful check- 
point and rollback propagation is avoided at the cost of potentially significant performance 
degradation during normal execution. Loosely-synchronized checkpointing schemes [6-8] re- 
duce the coordination overhead by taking advantage of loosely-synchronized checkpointing 
clocks and by bounding the message transmission delay. Independent checkpointing schemes 
[9—19] replace the checkpoint synchronization by dependency tracking and possibly message 
logging in order to allow maximum process autonomy. Rollback propagation is managed by 
searching for a consistent system state based on the dependency information. Process au- 
tonomy during normal execution is preserved by either allowing slower recovery or assuming 
a piecewise deterministic execution model [15]. Typically, each processor has to maintain 
multiple checkpoints and message logs to ensure successful recovery. 

This paper considers independent checkpointing schemes for nondeterministic execution 
[10]. Most research on this subject has concentrated on algorithms for finding the latest 
consistent set of checkpoints, i.e., the recovery line , during rollback recovery. The same 
algorithms can be applied to the set of existing checkpoints during normal execution to 
determine the global recovery line 2 . All the checkpoints and message logs older than the 
global recovery line then become obsolete and can therefore be discarded. Based on the 
observation that some of the non-obsolete checkpoints can also be discarded, we previously 
derived the necessary and sufficient conditions for a checkpoint to be non-discardable [20]. 
Let N be the number of processors, it was shown that there exists a set of N recovery lines 
which contains all the checkpoints possibly useful for any future recovery. We will show in 

2 The global recovery line can be used for recovery when the entire system crashes. A local recovery line 
is used when a subset of processors needs to roll back [9]. 
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this paper how to identify all discardable message logs in order to further reduce the space 
overhead 3 for systems with message logging in addition to checkpointing [12]. 

The outline of the paper is as follows. Section 2 describes the checkpointing and recovery 
protocol and the technique of recovery line transformation and decomposition. Section 3 
derives the necessary and sufficient conditions for identifying all discardable message logs 
and the experimental evaluation is described in Section 4. 


2 Checkpointing Protocol and Recovery Lines 

2.1 Checkpointing and Recovery Protocol 

The system model considered in this paper consists of a number of concurrent processes 
for which all process communication is through message passing. Processes are assumed to 
run on fail-stop processors [21] and each processor is considered as an individual recovery 
unit [13]. We do not assume deterministic execution or the existence of any mechanism for 
detecting and recording internal nondeterministic events [19, 22]. Consequently, if the sender 
of a message is rolled back, the corresponding message log will be invalid during reexecution, 
which means the receiver also has to be rolled back in order to undo the effect of the message. 

During normal execution, the state of each processor is periodically saved as a checkpoint 
on stable storage. Let C P lf k denote the kth checkpoint of processor p, with k > 0 and 
0 < i < N — 1, where N is the number of processors. A checkpoint interval is defined to 
be the time between two consecutive checkpoints on the same processor and the interval 
between CPi t k and CPi,k+ 1 is called the kth checkpoint interval. Each message is tagged 
with the current checkpoint ordinal number and the processor number of the sender. Each 
processor takes its checkpoint independently and updates the direct dependency information 

3 A simple sufficient condition based on local information exists for identifying some discardable messages 
before they are logged [12]. This paper considers the necessary and sufficient conditions based on global 
information for identifying all discardable logged messages. 
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table (or input table [10]) as follows: if at least one message from the mth checkpoint interval 
of processor pj had been processed during the previous checkpoint interval, the pair (j, m) 
is added to the table entry for the new checkpoint. 

A centralized garbage collection algorithm can be invoked by any processor periodically to 
reduce the space overhead. First, the dependency information for all existing checkpoints is 
collected to construct the checkpoint graph [9] (Fig. 1(b)). The rollback propagation algorithm 
[9] shown in Fig. 2 is executed on the checkpoint graph to determine the global recovery line 
according to the definition of consistency described later. All the checkpoints and message 
logs before the global recovery line then become obsolete and their space can therefore be 
reclaimed. The same procedure can also be invoked by any processor which initiates a 
rollback to determine the local recovery line. The only differences are each surviving processor 
takes an additional virtual checkpoint [10] so that the dependency information during the 
current checkpoint interval is also included in the checkpoint graph (called the extended 
checkpoint graph [9]), and each processor will roll back to the appropriate checkpoint when 
it is informed of the local recovery line. 

Two situations need to be considered for checkpoint consistency. In Fig. 3(a), CP ti k and 
CPj, m are inconsistent because of the orphan message [8] M a , or equivalently because CPj >m 
happened before [24] CP it k ■ In Fig. 3(b), the message Mb is an in-transit message , i.e., 
recorded as “sent but not yet received” , with respect to the system state containing C P t ,k 
and C Pj,m • It has been shown [1, 7] that checkpoints like CPi,k and CPj, m can be considered 
consistent if Mb is logged. Pessimistic (synchronous) message logging protocols [25-27] can 
ensure such a message is always properly recorded at the receiving end. This is also true for 
an optimistic logging protocol if the inclusion of a new checkpoint in the checkpoint graph 
is properly delayed based on the message logging progress [12]. As a result, we consider the 
situation in Fig. 3(b) as consistent. 
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Figure 1: Example checkpoint graph (a) the checkpoint and communication pattern; (b) 
the corresponding checkpoint graph with each directed edge representing a happened before 
relation. 


/* CP stands for checkpoint. Initially, all the CPs are unmarked */ 
include the latest CP of each processor in the root set; 
mark all CPs strictly reachable [23] from any CP in the root set; 
while (at least one CP in the root set is marked ) { 

replace each marked CP in the root set by the latest unmarked CP on the 
same processor; 

mark all CPs strictly reachable from any CP in the root set; 

} 

the root set is the recovery line. 


Figure 2: The rollback propagation algorithm. 



Figure 3: Checkpoint consistency (a) orphan message M a ; (b) in-transit message Mb- 
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2.2 Recovery Line Transformation and Decomposition 


We define a global check-point as a set of N checkpoints, one from each processor. Based on 
the previous description of checkpoint consistency, a consistent global checkpoint is a set of N 
checkpoints, one from each processor and no two of which are related through the happened 
before relation. A recovery line refers to the latest available consistent global checkpoint. 

Note that being obsolete is simply a sufficient condition for being discardable. Our goal 
is to derive the necessary and sufficient conditions for identifying all discardable checkpoints 
and message logs. A checkpoint is non-discardable if and only if it can possibly belong to a 
future recovery line, and a message log is non-discardable if and only if it can possibly become 
an in-transit message with respect to a future recovery line. (For the ease of presentation, 
if a message M is an in-transit message with respect to a recovery line L , we will say M 
intersects L or the dependency edge corresponding to M intersects L.) The difficulty comes 
from the fact that there are an infinite number of possible future recovery lines. Therefore, 
our first step is to find a finite set of recovery lines, which suffices for the purpose of optimal 
garbage collection. 

An operational session [10] is the interval between the start of normal execution and the 
instance of error recovery. Between two consecutive operational sessions is a recovery session. 
The entire program execution can be viewed as consisting of several operational sessions and 
recovery sessions. Within an operational session, new vertices are added to the checkpoint 
graph and can not have any outgoing edges to any existing vertices 4 . (If a graph G' can 
be obtained by adding new vertices to another graph G in this way, G' is called a potential 
supergraph of G.) Within a recovery session, existing vertices after the local recovery line are 
removed from the checkpoint graph. The above rules for checkpoint graph evolution then 
determine the possible future checkpoint graphs, and therefore the future recovery lines. 

We first define a set of 2 N immediate potential supergraphs which are the supergraphs 
of G and the subgraphs of G as shown in Fig. 4. G is constructed by adding an n-node n, 

4 Vertices with incoming edges from not-yet-collected vertices are temporarily excluded from the check- 
point graph. 
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Figure 4: The immediate potential supergraphs. 

with single incoming edge at the end for each process pi. Let U denote the set of all such 
n -nodes and 71C(G) denote the recovery line of a checkpoint graph G. The recovery line 
transformation procedure first transforms every possible future recovery line of G backwards 
in time into the recovery line of one of G's 2 N immediate potential supergraphs. The recovery 
line decomposition procedure then further reduces this set of 2 N recovery lines {1ZC(G— W) : 
W C U) to the set of N recovery lines {7ZC(G — n,) : n; € U}. We will describe the 
transformation and decomposition procedures by using the example in Fig. 5. Formal proofs 
can be found in [20]. 

Suppose G in Fig. 5(a) is the current checkpoint graph considered for garbage collection. 
Fig. 5(b) shows the extended checkpoint graph when p 3 later initiates the first rollback and 
G c is the checkpoint graph immediately after the recovery. Fig. 5(d) shows another possible 
extended checkpoint graph when p 0 initiates a second rollback. We now describe how to 
transform and decompose 7 ZC(Gd), a typical future recovery line of G. 

Transformation within an operational session: First we consider G c and Gd where G c 
is the starting checkpoint graph of a new operational session and Gd is a potential supergraph 
of G c . For checkpoints X , Y and Z which belong to 7ZjC(Gd) but are not in graph G c , we 
replace them by their corresponding n-nodes P , Q and R for G c as shown in Fig. 5(g). 
7ZC(Gd) = {A,B,X,Y,Z} is then transformed into 7 ZC(G g ) = {A, B, P,Q, R} where G g is 
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Figure 5: Example recovery line transformation. 
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an immediate potential supergraph of G c . 

Transformation across consecutive operational sessions: Now we consider G g and 
Gb, the last checkpoint graph of the first operational session. Of the three n-nodes P , Q and 
R in 7 ZC{G g ), only Q and R come from the processors which were rolled back during the 
first recovery. We replace them by C and D , the corresponding checkpoints which were on 
the local recovery line. 7 ZC(G g ) is then transformed into TIC(G/) = {/4, B, P, C, D). Notice 
that Gj is an immediate potential supergraph of Gb and is therefore a potential supergraph 
of G. By repeatedly and alternately applying the above two transformation procedures, 
every future recovery line can be transformed into another recovery line in the following set: 
{RC{G-W) : WC U). 

Recovery line decomposition: Let min(S) denote the set of minimal elements, i.e., 
vertices with no incoming edges, of S. By utilizing the lattice properties of the maximum- 
sized antichains on a partially ordered set [24,28], each of the 2 N recovery lines can be 
decomposed as: 

7 ZC(G — W) = min( [J 1ZC(G — n,-)). (1) 

n,€W 

For example, the recovery line of G e = G — {no, nj, n^, 714} in Fig. 5(e) has the following 
decomposition (refer to Fig. 6) 

7 ZC(G e ) = min(1ZC(G — n 0 ) U TZC(G — ni) U RC(G — n 3 ) U 1ZC(G — n 4 )) 

= min({A, B ,ri2,n3,n4,nQ, I ,ni, J,C, D}) = {A, B,ri2,C, D} . 

3 Message Log Reclamation 


By using the techniques described in the previous section, it has been shown that the set 
of all non-discardable checkpoints is equal to the union of the N recovery lines 7ZC(G — n,), 
m E U (except for the nd s) [20]. For the example shown in Fig. 6, while all the checkpoints in 
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G are non-obsolete, only those checkpoints corresponding to the shaded vertices in Fig. 6(f) 
are non-discardable. 

In addition to the checkpoints, message logs 5 constitute another storage space overhead 
[12]. By following the transformation and decomposition procedures, we will show in the 
following that a message log is non-discardable, i.e., can possibly intersect a future recovery 
line, if and only if it intersects one of 7 ZC(G — n,)’ s. 

3.1 Recovery Line Transformation and Decomposition 

Instead of considering each individual message, we use its corresponding edge in the check- 
point graph for our discussion. Let (a, b) represent the directed edge starting at vertex a 
and pointing to vertex 6. Clearly, (a, b) intersects a recovery line 'JZC(G) if a is on the left 
hand side of 7 ZC(G) and b is on the right hand side of 7 ZC{G). 

LEMMA 1 If (a, b) can possibly intersect a future recovery line, (a, b ) must intersect IZC(G— 
W ) for some W C U . 

Sketch of the proof. Again, we use the example in Fig. 5. The edge ( E , F) in G can 
intersect a possible future recovery line 'R.C(Gd)- We will show that ( E , F ) must also intersect 

nc(G e ). 

Transformation within an operational session: First consider G c , 7 ZC(G g ) and 'RC(Gd)- 
Any vertex of G c which is on the left (right) hand side of 7 ZC(Gd) must remain on the left 
(right) hand side of 1Z£(G g ). Therefore, any edge of G c intersecting 'RC(Gd)-, for example 
(E,F), must also intersect 7 ZC(G g ) after the recovery line transformation. 

Transformation across consecutive operational sessions: Now consider G c , 7 ZC(G g ) 
and FC(Gf). All vertices of G c which are on the right hand side of 'RC(Gg) must remain on 
the right hand side of 7 ZC(G j) because the transformation can only push the recovery line 
to the left. Those on the left hand side of 7 ZC(G g ) remain on the left hand side of 1ZC(G j) 

5 The message logs considered in this paper are used for recording the state of the channels [1] instead of 
replaying for deterministic state reconstruction [13]. 
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except for C and D. However, C and D can not have any outgoing edges in G c because they 
were part of the local recovery line and therefore all such edges must have been removed 
during the recovery. Hence, any edge of G c intersecting 7l£(G g ), for example (E,F), must 
also intersect 7 Z£(G /) after the transformation. 

Finally, we can show that ( E,F ) also intersects Tl£{G e ) by again applying the transfor- 
mation within an operational session. D 

LEMMA 2 mm(U n . 6W - 1Z£(G - n<)) in Eq. (1) is equivalent to the set of the N leftmost 
checkpoints, one from each processor, among the checkpoints in the union. 

Proof. If a checkpoint v of pi is not the leftmost checkpoint of p, in the union, then v can 
not be a minimal element because there exists at least one checkpoint on its left. Conversely, 
if v is the leftmost checkpoint of p,-, v must be in min(\J nieW 1Z£{G - nf)) because there 
are only N such checkpoints and 1Z£[G — W) = min(\J ni€W 7Z£(G — n,)) must consist of N 
checkpoints. ^ 

LEMMA 3 If(a,b) intersects 1Z£{G — W) for some W C U, ( a, 6 ) must intersect 7Z£(G — 
n t ) for some n, E U. 

Proof. Suppose (a, b ) does not intersect any of the N recovery lines 1l£(G - n t ), n, E U. 
Then each of the N recovery lines must lie either entirely on the right hand side of (a, b ) or 
entirely on the left hand side of it. 

Recovery line decomposition: Given any of the 2 N recovery lines 7 Z£(G — W), W C U , 
if all 7 Z£(G — n,)’s, n, E W, are entirely on the right hand side of (a, b), 7 Z£(G — W) must 
also lie on the right hand side of (a, 6) by Eq. (1) and Lemma 2; if at least one 7t£(G — n,), 
n; 6 W, lies entirely on the left hand side of (a, 6), 1Z£(G - W) will be on the left hand 
side of (a, b ) again by Lemma 2. Therefore, we have shown that (a, b) can not intersect 
any 7 Z£(G - W) if it does not intersect any 7 Z£(G - n,). Conversely, if (a, 6) intersects 
7Z£(G — W) for some W C U, (a, b) must intersect F£(G — ni) for some rij E U. tH 


11 


3.2 The Algorithm 

We now state the necessary and sufficient conditions for a message log to be non-discardable. 

THEOREM 1 A message log is non-discardable if and only if its corresponding edge in the 
checkpoint graph intersects FC(G — nf) for some n i € U . 

Proof. The only if part follows immediately from Lemmas 1 and 3. The if part comes 
from the fact that every 7 ZC(G - n t ) is also a possible future recovery line. □ 

Theorem 1 also gives the algorithm for finding all non-discardable message logs: first 
compute the N recovery lines 7 ZC(G — n t ), n, € U\ only those message logs with their 
corresponding edges intersecting any of the N recovery lines are non-discardable. In Fig. 6, 
the edge ( E , F) intersects FC(G - n 0 ), ( G , H) intersects 7 ZC(G - n 4 ) and none of the edges 
intersects FC(G - rai), FC(G - n 2 ) or 7 ZC(G - n 3 ). Therefore, although all the edges in 
Fig. 6(f) are non-obsolete, only those message logs corresponding to (£, F ) and (G, H) need 
to be retained. 

There is an interesting difference between checkpoint reclamation and message log recla- 
mation. While the set of non-discardable checkpoints is determined by the union of the 
N recovery lines 7 ZC(G - n;), n, € U, the set of non-discardable message logs is affected 
by the position of each individual recovery line. Fig. 7 illustrates such a difference. The 
non-discardable checkpoints a, b, c and d in Fig. 7(a) remain non-discardable in Fig. 7(b) 
when e is added to the graph. However, the non-discardable message logs corresponding to 
the edges (6, d) and (c, d) in Fig. 7(a) become discardable as the addition of e changes the 
positions of 7 ZC(G — n i) and 7 Z£(G — n 2 ). 

4 Experimental Results 

Three hypercube programs are used to illustrate the message log reclamation capabilities 
and benefits of our algorithm. They are Cell placement, Channel router and QR decomposi- 
tion, running on an 8-node Intel iPSC/2 hypercube. Communication traces are collected by 
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Figure 7: The difference between the reclamation of checkpoints and message logs. 

intercepting the “send” and “receive” system calls. Communication trace-driven simulation 
is then performed to obtain the results. The execution time for each program is listed in 
Table 1. The checkpoint interval is arbitrarily chosen to be approximately one tenth of the 
execution time. 


Table 1: Execution time and checkpoint interval. 


Programs 

Cell placement 

Channel router 

QR decomposition 

Execution time (sec) 

324 

469 

370 

Checkpoint interval (sec) 

35 

40 

35 


Figs. 8-10 compare our algorithm with the traditional garbage collection algorithm for 
the three programs in terms of the number and size of the retained message logs. Each curve 
shows the remaining space overhead after garbage collection if the algorithm is invoked after 
a certain number of checkpoints have been taken. Since the checkpointing clocks on all nodes 
are approximately synchronized, checkpoints #8 n through #8(n+l)-l are taken at about 
the same time, which explains the fact that the number of messages is almost constant within 
that interval. 

The domino effect is illustrated by the constant increase in the number of non-obsolete 
message logs as the total number of checkpoints increases, for example, between checkpoints 
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#40 and #64 in Fig. 8(a) and between checkpoints #48 and #88 in Fig. 9(a). The figures 
show that our algorithm performs consistently better than the traditional algorithm and is 
particularly effective when the domino effect persists. 


5 Concluding Remarks 


We have shown that some of the non-obsolete message logs in an independent check- 
pointing protocol can be discarded because they can never be useful for any possible future 
recovery. An algorithm was developed for finding all discardable message logs in order to 
minimize the space overhead. Communication trace-driven simulation results for three hy- 
percube programs showed that the algorithm can be effective in reducing the message log 
space overhead for real applications. 
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